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The hepadnavirus encapsidation sig- 
nal, epsilon (s), is an RNA struc- 
ture located at the 5' end of the viral 
pregenomic RNA. It is essential for viral 
replication and functions in polymerase 
protein binding and priming. This struc- 
ture could also have potential regulatory 
roles in controlling the expression of viral 
replicative proteins. In addition to its 
structure, the primary sequence of this 
RNA element has crucial functional roles 
in the viral lifecycle. Although the e 
elements in hepadnaviruses share com- 
mon critical functions, there are some 
significant differences in mammalian and 
avian hepadnaviruses, which include both 
sequence and structural variations. 

Here we present several covariance 
models for e elements from the Hepadna- 
viridae. The model building included 
experimentally determined data from 
previous studies using chemical probing 
and NMR analysis. These models have 
sufficient similarity to comprise a clan. 
The clan has in common a highly con- 
served overall structure consisting of 
a lower-stem, bulge, upper-stem and 
apical-loop. 

The models differ in functionally 
critical regions — notably the two types 
of avian s elements have a tetra-loop 
(UGUU) including a non-canonical UU 
base pair, while the hepatitis B virus 
(HBV) epsilon has a tri-loop (UGU). The 
avian epsilon elements have a less stable 
dynamic structure in the upper stem. 
Comparisons between these models and 
all other Rfam models, and searches of 
genomes, showed these structures are 
specific to the Hepadnaviridae. Two 
family models and the clan are available 
from the Rfam database. 



Hepatitis B Virus 

The human hepatitis B virus (HBV) is a 
major health problem worldwide with an 
estimated 370 million individuals chroni- 
cally infected. Chronically infected patients 
have an increased risk of developing liver 
cirrhosis and liver cancer resulting in over 
a million deaths annually. 1 ' 2 

HBV is a member of the Hepadnaviridae, 
a family of small hepatotropic DNA 
viruses. Hepadnaviruses are known to 
infect certain mammals (orthohepadna- 
virus) and birds (avihepadnavirus). These 
viruses have a unique replication lifecycle 
in that their partially double-stranded 
DNA genomes are replicated through an 
RNA intermediate, the pregenomic RNA 
(pgRNA). 3 Hepadnaviruses are related to 
retroviruses in that they are both retro- 
transcribing viruses and share some 
general characteristics. 

Current antiviral drugs such as inter- 
feron 0C and nucleoside analogs, while 
effective in some cases, have problems of 
limited efficacy and viral resistance after 
prolonged treatment. 4,5 A better under- 
standing between viral and host factors is 
therefore necessary to facilitate novel anti- 
viral drugs and strategies. A key cis-acting 
RNA element that acts at several steps 
in the process is the epsilon (e) encapsida- 
tion signal. 

The Structure and Location of 
e Elements in Hepadnavirus RNAs 

The pgRNA also serves as the mRNA 
template for the translation of the replica- 
tive proteins, the core and polymerase 
(P) protein. 610 The pgRNA is one of 
two greater than genome length mRNAs 
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transcribed from the viral genomes. The 
other mRNA being the precore RNA 
(pcRNA), from which the precore protein 
is translated 11,12 (Fig. 1). 

The Functions of the e Element 
in Reverse Transcription 
and Replication 

In hepadnaviruses, the processes of reverse 
transcription and encapsidation of the 
pgRNA are facilitated by the e encapsida- 
tion signal. The s element spans a region 
of approximately 60 nucleotides and is 
located at both the 5 'and 3' ends of the 
pgRNA and pcRNA. While both the 
pgRNA and pcRNA are translated, only 
the pgRNA is reverse transcribed and 
encapsidated. Efficient translation of the 
precore protein across the pcRNA s element 
melts the RNA structure and prevents 
pcRNA encapsidation. 13 " 15 Furthermore, 
only the 5' s of the pgRNA has been 
shown to be essential for these processes, 
whereas the 3' s which have slightly 
different conformations is not used. 16 

During reverse transcription, the 5' e 
element recruits the P protein to the upper 
stem, then the TP domain of P initiates 
priming at the conserved bulged UUCA 
and the synthesis of the minus-strand 
DNA (Fig. 2). This process involves con- 
formational changes in both the e structure 
and bound P protein which open up the 
base pairing in the upper stem allowing 
reverse transcription from the bulge. 18 ' 19 



These conformational changes and recruit- 
ment of P protein are also facilitated by 
cellular chaperones. 17 ' 19 " 21 

Most of the encapsidation process for 
hepadnaviruses were determined from 
studies done on the avian Duck hepatitis 
B virus (DHBV) in vitro. 13 ' 20 ' 22 ' 23 In these 
studies, the sequence and structure at the 
upper stem, bulge and also sequences at 
the upper region of the lower stem were 
shown to be important for both P protein 
binding and encapsidation. 13 ' 24 ' 25 Interest- 
ingly, despite its secondary structure the e 
region has been targeted effectively by 
RNA. 26 

There are significant variations between 
the different members of hepadnaviruses. 
These include notable primary sequence 
difference within the e element between 
the avian and mammalian hepadnaviruses. 
There are also distinct differences in 
binding requirements for P protein at the 
upper stem which is less well based paired 
in most avian hepadnaviruses (except some 
DHBV, Fig. 2). In addition, the initiation 
of DNA synthesis successfully shown in 
the DHBV system in vitro has so far 
unable to be shown for HBV, indicating 
significant differences in the elements. 

This study aims to build covariance 
models of hepadnavirus e elements that 
will uniquely identify them. These can be 
used to investigate the similarities between 
these models and to other known, or 
previously undetected, RNA structural 
elements. 



Results 

Generation of covariance models for 
hepadnavirus e elements. The e element 
is well conserved in overall structure 
between the mammalian and avian hepad- 
naviruses, despite the viruses having sig- 
nificant genome divergence and differing 
in the presence or absence of other cis- 
regulatory elements. 27 

The hepadnaviruses e elements share 
common structural features, namely (1) 
lower stem, (2) the central bulge, (3) 
upper stem with the apical loop where the 
P protein binds (Fig. 2). 20 The secondary 
structure of the HBVs and DHBVe have 
extensive base— pairing in both stems, 
while a more open (reduced base-pairing) 
structure in the upper stem region was 
observed for the HHBVe (Fig. 2). Despite 
sharing similarities to the HBVs at the 
base-paired upper stem, the DHBVe 
had a less stable (thermally unstable) 
upper stem than initially believed and 
could potentially assume an open struc- 
ture similar to HHBVe under physio- 
logical conditions. 20 Both the DHBVe 
and HHBVe have a stable tetra-loop 
(UGUU) including a non-canonical UU 
base pair in the apical loop, while the 
HBVs has a tri-loop (UGU). 

The sequences of the s element from 
representative mammalian and avian 
hepadnaviruses were extracted from public 
DNA databases (Methods). The sequences 
were chosen to represent the diversity 




Figure 1 . A schematic representation of the greater then genome-length HBV pgRNA and pcRNA. Cis RNA elements, namely, epsilon (e), direct repeat 1 
and direct repeat 2 (DR1 and DR2). The e structure is present at both 5' and 3' termini of the pgRNA, but only the 5' £ of the pgRNA is selectively 
recognized for packaging. It facilitates polymerase (P) binding as depicted by the Terminal Domain (TP) and Reverse Transcriptase (RT) domain. The TP 
domain initiates protein priming at the bulge of the 5' £ and after initial priming translocates to the 3' end DR1 acceptor site where complementary 
base-pairing to the £ donor allows the RT to initiate the minus strand DNA synthesis. The pcRNA is exactly the same as the pgRNA except for a longer 5' 
leader, it encodes the precore ORF and also contains the £ element but due to the translation of the precore does not function in encapsidation. 
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Figure 2. The secondary structure of HBV, DHBV, HHBV £ elements (derived from NMR, structural probing, and functional studies) The £ structure is 
remarkably conserved throughout the hepadnaviruses. It features two stem-loop structures, a conserved central bulge and an apical loop: tri-loop 
in human HBV (A) and tetra-loop in duck (B) and heron (C) hepadnaviruses. The core (C) start is included within the £ in HBV and follows it in avian 
viruses. Additional short upstream ORFS are also found (CO, uORFI, uORF2). The CO ORF spans the £ structure within the orthohepadnaviruses as 
represented in HBV. Avian HBV have two similar short conserved uORF (uORFI, uORF2) which start near the end of the £ structure. This figure is adapted 
with permission from Beck J, et al. 17 Also shown are the associated interacting factors involved in the encapsidation process. Domains of the P protein are 
abbreviated as follows: terminal protein (TP), RNase H domain (RH), reverse transcriptase domain (RT). Open circles represent cellular chaperone also 
essential in assisting P protein to bind to the £. The sequence and numbering according to DDBJ accession number AB037684, the number 35 
corresponds to nt number 1850 in the ayw subtype. The sequence of the DHBV is from K01834 and the HHBV is from M22056. 



of HBV genotypes (A-H) in a reference 
alignment used in Panjaworayan et al. 2S 
Although the genotypes differ by over 8% 
sequence overall, the e element is highly 



conserved due to its multiple functions. The 
secondary structure is conserved in all 32 
members of the reference alignment, except 
for an A-G mismatch in the middle of the 



lower stem in all four genotype A viruses 
(orange in Fig. 3A). This mismatch is 
unexpected, but non-canonical A-G base 
pairs can be accommodated with some 
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Figure 3. Alignments of families of 
epsilon elements— HBV (A) DHBV (B) 
HHBV (C) a combined model AHBV (D). 
The SS line represents the consensus 
structure in dot-bracket notation, 
dots are unpaired, brackets are paired. 
In A-C, compensating base changes 
are depicted in green, base pairs 
incompatible with the consensus 
structure in orange. In the combined 
model D, blue shading represents 
compatibility with the structure line 
(SS_cons). Stem (Sm) loops and bulges 
are indicated. These Stockholm format 
files and models are available in 
the Supplementary Material. 
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distortion within an A-helix. 29 However, 
this may indicate structural tolerance at this 
position. The closely related orthohepadna- 
viruses (ground squirrel and woodchuck 
hepatitis virus) e elements have an inserted 
C after this point, also indicating tolerance 
(Methods). 

Some sequences show compensating 
base changes within the structure (green 
in Fig. 3A). These changes give independ- 
ent support for the existence of these base 
pairs. Orthohepadnaviruses also have two 
of these compensating base changes, but 
no additional changes not seen in the 
HBV alignment. Notably one HBV geno- 
type C (AB048704) has a compensating 
G-U closing pair adjacent to the apical tri- 
loop, providing additional covariance sup- 
port for this pair previously observed in 
the NMR structure 30 

A multiple alignment was assembled 
and manually refined by structure and 
sequence conservation to form a curated 
seed alignment (Fig. 3). Alignments of 
these four elements: HBV e, DHBV e, 
HHBV s, and a combination of these 
two — Avian HBV epsilon (AHBV s), are 
shown in Figure 3A-D and available 
in supplementary files. HBV_epsilon 
(RF01407) and AHBV_epsilon (RF01313) 
are also available through Rfam with 
corresponding Wikipedia entries. 

Due to the significant sequence differ- 
ence and function between the e element 
of mammalian, heron and duck hepadna- 
viridae, alignments were initially done for 
each and separate covariance models built 
for each family (Fig. 3B, C). For DHBV 
there are several Chinese isolates for which 
positions in the upper stem in which the 
bases are incompatible with a canonical 
structure (e.g., C-C, C-U, AY433937 



China_GD2, orange Fig. 3B). This obser- 
vation supports the notion that this helix 
may unstable in DHBV, similar to HHBV 
(Fig. 3C). 25 This contrasts with solution 
structures of a South African isolate of 
DHBV (e.g., AY250904, Fig.3B) that 
show the upper stem extensively paired in 
an isolated RNA. 31 A combined avian 
model (Fig. 3D) with less pairing in the 
upper stem (similar to HHBV, C) permits 
most pairs to be compatible (blue shading 
Fig. 3D). These four models were used for 
further analysis. 

Comparison of the £ Models 
to Each Other to All Other 
Rfam Families 

The four covariance models were com- 
pared with each other using CMCompare. 
In a comparison of related and unrelated 
Rfam models a score of over 20, or 
E < 1.0 were considered worthy of note. 
However about 7.4% of pairwise com- 
parisons of Rfam models had scores over 
20, and 6.3% over 28. 32 The HHBV and 
DHBV models are most similar (Score 48, 
Fig. 4), with HBV and DHBV less similar 
(Score 28). The combined avian model 
(AHBV, left) showed greater similarity 
(Score 54) to the HBV model than either 
alone (Scores 28,10). The maximum 
possible scores for these models, matches 
to themselves, are shown in Figure 4 
(Scores 84, 271, 85, 88). The elements 
have sufficient overall functional and 
structural similarity to form a new clan 
of Rfam models. 

CMCompare was also used to compare 
the e models to all other families in Rfam. 
Weak matches with scores of 20—26 
were seen with many other elements, 



generally a match to stem loop subregion 
of the alignment. The best matches 
were between MicC non-coding RNA 
(RF00121) and HBVe (Score 26), and 
Equine arteritis virus (EAV) leader TRS 
hairpin (LTH) (RF00498) and DHBVe 
(Score 21). Interestingly the EAV hairpin 
has a role in minus stranded RNA 
synthesis in that RNA virus. 33 

However in general, the new s models 
are not structurally similar to functionally 
related replication elements of other viruses 
(scores <15), for example, Hepatitis C 
virus or retroviruses (HIV-1 DIS, RF0015). 
There are nine replication elements in 
Rfam from human viruses- Entero_CRE 
(RF00048), Entero_5_CRE, Flavi_CRE 
(RE00185), HepC_CRE (RF00260), 
Cardiovirus CRE, Rota_CRE, and 
plant viruses- Tombus_CRE (RF00510), 
CTV_rep_sig (RF00193). Although all 
these replication elements form at least a 
stem loop structure, they are structurally 
distinct from the clan proposed here, so 
are not included as part of the clan. 

Searching Sequence Databases 
for Similar Elements 

These four covariance models (Fig. 3 and 
4) were calibrated (using cmcalibrate) and 
used to search on both strands of all the 
curated RefSeq viral genomes, the viral 
division of GenBank and RFamSeqlO 
using cmsearch (Methods). Cmsearch 
generates a bit score report based on the 
match of the model to the sequence. It also 
provides an E value (which corresponds to 
the expected number of false positives in 
a database of this size). Hits with E values 
of <0.1 are considered trustworthy. 34 

The HBV e model was built from 
sequences representing the diversity of 
the common HBV genotypes. It has 
significant matches to 6910 sequences 
in the RFamSeqlO database, all of which 
are from mammalian HBVs. These 
matches represent the diversity of HBV 
genotypes (A-H). The search identified 
some additional mammalian HBV viruses, 
e.g., woodchuck HBV. Some apparently 
diverse matches are due to misclassifica- 
tions in the EMBL taxonomy: one match 
is classed by EMBL and RFamSeqlO as 
being a hepatitis A virus, but is clearly a 
HBV, one is classed as rock squirrel 
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Figure4. Similarity between the covariance models. The models were compared using 
CMCompare. Higher scores indicate greater similarity. The Avian model AHBV £ (left) 
is a combination of the Heron (HHBV epsilon) and Duck (DHBV epsilon) models (right). 
The maximum similarity score a model could have is that with itself, greater than 20 is likely 
significant. The next most similar model had score of 26 (see results). 
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genome, but is a rock squirrel HBV. There 
is also a match to a synthetic human HBV 
containing construct. The next best match 
in RFamSeqlO is not significant, indic- 
ating that good matches to this element are 
not found in cellular genomes (e.g., the 
human host genome). Although portions 
of HBV can be integrated into the human 
genome, it is not unexpected that the 
reference human genome does not contain 
an s like sequence. Similar results were 
obtained from the GenBank viral division 
and RefSeq viruses. 

DHBV e, HHBV e and the combined 
AHBV epsilon model all match the same 
six avian hepadnaviruses in RefSeq viruses 
(Scores: 53-83, E < 1(T 9 ) with 54 hits 
in RfamSeqlO, and 58 hits in the viral 
division of GenBank. Indeed, the com- 
bined avian model identifies the same 
set with better scores (Score > 40, E < 5 
x 10~ 3 ). This combined model constitutes 
Rfam model RF01313. The separate 
DHBV, HHBV epsilon models presented 
here are also available (Supplement). 
Notably the combined model recognizes 
divergent viruses (e.g., Stork HBV 
sequences (AJ251937). 25 

The next best matches in viral genomes 
are marginal matches to long bacteriophage 
DNA genomes (DHBV, NC_015289, 
Score: 24, E = 0.43; HHBV, NC_0 12697, 
Score: 22, E = 0.92). The matched regions 
encode bacteriophage proteins and do not 
appear to be biologically significant. 

There were no significant matches to 
other retro-transcribing viruses (best Score: 
5.0, E = 1.6). There were also no significant 
matches to HDV, this is not unexpected, 
as HDV is only dependent on HBV for 
envelopment and not encapsidation. 

Discussion 

We have shown that the RNA families of e 
replication elements proposed here com- 
prise a clan with both RNA structural and 
functional similarities. The hepadnavirus e 
plays several key roles in the viral lifecycles 
and has a similar role in the families 
although they are sufficiently distinct to 
comprise two or more separate families. 
Although we tested three separate models 
it was found that a human and avian 
model provided the best discrimination 



between matches and false positives. This 
is consistent with experimental data that 
shows that the upper stem in which 
basepairing may differ between avian 
viruses is very tolerant to variation 25 

Therefore this work also suggests that 
basepairing in the upper stem is not 
essential. 

Although the s replication element has 
some similar functions to replication ele- 
ments from other retro-transcribing and 
RNA viruses we could detect no signifi- 
cant similarity beyond that expected of a 
stem-loop structure. This was determined 
by directly comparing the models to one 
another. This is consistent with published 
functional studies where viral or host 
proteins are specific for the replication 
elements in the corresponding viruses. 35 

Searches of known viral and non-viral 
sequences showed that the models can 
specifically detect the elements in the 
context of a viral genome within large 
databases of sequences. They revealed no 
significant matches to the mammalian 
host genomes, although the genomes of 
duck or other infected birds are not yet 
available. This type of search with a 
covariance model is more tolerant of 
substitutions/covariations within the struc- 
ture than traditional blastn searches. 
Therefore this analysis supports the idea 
that this element is a virus specific target. 

The models proposed here are specific 
for the hepadnaviruses. These models add 
to the basis for further research into the 
specific bases and structures required for 
the e functions in the viral lifecycle and 
also potential antiviral strategies targeting 
these elements. Indeed RNAi strategies 
against the conserved HBV s region have 
been effective despite its secondary struc- 
ture. This structure was expected to reduce 
the ability of siRNAs or synthetic miRNAs 
to target this region. 36 In addition, the 
protein/RNA interaction in replication 
and the initiation of replication has been 
a target of anti-HBV drugs. 37 

Materials and Methods 

Sequences were extracted representing the 
diversity of mammalian and avian hepadna- 
virus genomes. The principal features of the 
structures in different functional states were 



extracted from the literature. For HBV there 
are over 6000 sequences in public databases 
but many are identical in this region. The 
sequences chosen were from a previously 
published HBV reference set to represent 
common diversity and are similar to other 
published genotyping sets for HBV. The 
sequences of orthohepadnaviruses were also 
compared (NC_001484, NC_004107). 
They have an inserted C relative to HBV 
at position 5 (UGUUCCA) and several 
compensating changes also found in HBV 
genomes. 

For HHBV and DHBV, fewer se- 
quences are available. There are other 
members of the avian hepadnaviruses 
but there is currently too little diversity 
in the data and insufficient experimental 
evidence to form an alignment from 
which to build separate models. 

Alignments were done manually using 
AquaEmacs in Ralee mode guided by 
structural probing and NMR studies 
(PDB:20J7, 20J8, 2K5Z, 2IXY, 2IXZ) 
and considering the modeling done of 
the lower stem by other groups. 38 These 
structures were determined by chemical 
and enzymatic probing and also NMR 
analyses on HBVs, DHBVs and 
HHBVe. 13 ' 15 ' 23 ' 303 1 39 - 41 Compatibility 
with minimum free energy structures 
was ascertained by folding individual 
sequences. Covariance models were built 
from alignments using cmbuild 1.0.2 and 
calibrated using cmcalibrate. Comparison 
of models was done with CMCompare. 32 
The HBV_Epsilon and AHBV_Epsilon 
have been submitted to Rfam as RF01407 
and RF01313, and comprise Rfam clan: 
CLN00104. 

Data sets analyzed. Sequences were 
searched against each calibrated model 
using cmsearch (for RefSeq) or Rfam_ 
scan followed by cmsearch for larger 
databases 42 Three data sets were used — 
(1) RefSeq 47 (12/5/2011) viral gen- 
omes — a curated set of viral genomes with 
limited redundancy (2) the viral division of 
GenBank (4/7/2011) and (3) the most 
recent RFAMSEQ, 10, derived by Rfam 
from EMBL 100 (29/5/2009). 
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